EndoGeneAnalyzer: A tool for selection and validation of reference genes

The selection of proper reference genes is critical for accurate gene expression analysis in all fields of biological and medical research, mainly because there are many distinctions between different tissues and specimens. Given this variability, even in known classic reference genes, demands of a comprehensive analysis platform is needed to identify the most suitable genes for each study. For this purpose, we present an analysis tool for assisting in decision-making in the analysis of reverse transcription-quantitative polymerase chain reaction (RT-qPCR) data. EndoGeneAnalyzer, an open-source web tool for reference gene analysis in RT-qPCR studies, was used to compare the groups/conditions under investigation. This interactive application offers an easy-to-use interface that allows efficient exploration of datasets. Through statistical and stability analyses, EndoGeneAnalyzer assists in the select of the most appropriate reference gene or set of genes for each condition. It also allows researchers to identify and remove unwanted outliers. Moreover, EndoGeneAnalyzer provides a graphical interface to compare the evaluated groups, providing a visually informative differential analysis.


Introduction
Reverse transcription-quantitative polymerase chain reaction (RT-qPCR or qPCR) is a highly sensitive and specific technique used to study gene expression in many research fields, such as human disease, because of its capacity to detect rare transcripts and observe small variations in gene expression [1][2][3][4].
RT-qPCR is a technique widely used for quantifying gene expression levels.By quantifying the the RNA molecules present in a sample, RT-qPCR provides valuable insights into the expression of specific genes.To ensure accurate and reliable results, reference genes are used in the normalization process.These reference genes are stably expressed under various experimental conditions and serve as internal controls to normalize the gene expression data.Normalization with reference genes allows for a more accurate comparison of gene expression levels across different samples or experimental conditions, eliminating potential variations caused by factors such as RNA quality, sample quantity, or technical variations [5].According to Chervoneva et al [6], among the essential criteria for choosing a good reference gene are: the level of expression unaffected by experimental factors, minimal variability in its expression between tissues and physiological states of the organism, and, preferably, that the gene has a quantification cycle (Cq) value similar that of the target gene.The Cq value indicates the position of the amplification curve with respect to the cycle axis [7].
According to the MIQE guidelines, the selection and the number of reference genes are essential, especially because they need to be experimentally validated for each specific sample type and study condition [8,9].Thus, ideal reference genes should have a minimum intersubject variation in terms of quantification cycle (Cq) values, and it is recommended that this variation of the reference genes between samples be less than 1 Cq [8].This characteristic is crucial in the data normalization process for expression comparisons, as it ensures accurate mRNA concentration measurements and reliable conclusions [10].The normalization of data using reference genes involves correcting errors that arise from the initial concentration of RNA/ cDNA, and the most common method used in RT-qPCR assays involves the use of one or more reference genes [11].
Studies have demonstrated the variability of commonly used reference genes, such as GAPDH, β2M, and 18S, under various conditions [12][13][14][15].In aging studies, variability in GAPDH expression, when used as an internal control, interferes with the detection of subtle variations in the target genes under investigation [16].Selecting an unstable reference gene for qPCR normalization can compromise experimental accuracy.Commonly used reference genes such as 18S, ACTB, and GAPDH may not always be suitable for this purpose and should be validated for stability in a specific study context [11].Choosing an inappropriate reference gene leads to inaccurate normalization and misleading conclusions.This approach may also introduce variability and bias, hindering comparisons between samples [10].
In this context, algorithms have been developed to help identify the most appropriate reference genes from a given set of candidate genes.NormFinder [17], geNorm [18], BestKeeper [19], RefGenes [9], and RefFinder [20] are some software tools available for this purpose, with RefFinder being the only web-based tool currently available.
In this study, we developed the EndoGeneAnalyzer tool, available at https://npobioinfo.shinyapps.io/endogeneanalyzer/.The open-source code can be found at https://github.com/MoreiraFC/EndoGeneAnalyzer.This tool is a dynamic web-based tool for comparing and selecting the most stable set of reference genes from a dataset derived from RT-qPCR experiments.It also integrates NormFinder software [17].Unlike existing algorithms, this tool allows the identification of variations by group/condition and the removal of outliers present in reference genes, a step often overlooked in most gene expression studies.Furthermore, the tool provides ability to analyze the differential expression of the target genes across different groups/conditions, allowing the investigation of differences in the expression of the gene of interest and the identification of potential associations with experimental conditions.

EndoGeneAnalyzer platform
EndoGeneAnalyzer is a dynamic web-based platform that simplifies and assists in selecting reference genes in scientific studies and performing differential gene expression analysis for RT-qPCR data.With interactive interfaces and a statistical approach, the tool facilitates the identification of the best reference genes for the investigated groups or conditions.The Endo-GeneAnalyzer workflow is illustrated in Fig 1 .The tool has been developed to be intuitive, interactive, and efficient, with several steps to guide the user.In the first step, the user enters the data with the option to choose between the supported file formats:.xls/.xlsxor.txt/.csv.This flexibility makes it easy to import data from different sources.After loading the data, the user selects the targets of interest for analysis (nonreference genes).The next step focuses on evaluating the best reference set of genes based on mean variation using descriptive statistical data such as gene standard deviation, the sum of squared differences between the mean of each group and the gene mean, and the sum of squared differences between the standard deviation of each group and the gene standard deviation.Stability metrics are also calculated using NormFinder, which helps to identify the genes that best fit the study conditions.One of the critical innovations of EndoGeneAnalyzer is the ability to analyze the stability of reference genes.In this step, the tool allows the user to identify and remove outliers, which are samples with ΔCq mean values above or below a user-defined threshold (default = 2 standard deviations), providing flexibility in the analysis.
Finally, the EndoGeneAnalyzer can perform differential expression analysis using the target ΔCq and the mean ΔCq of the set of reference genes.This step allows accurate and efficient comparisons between different groups or conditions, further delivering a fold change result.

Data upload
This step is critical to ensure that correct information is used during the analysis.The input file must contain the following columns: i) the first column with the sample names; ii) the following columns with the mean Ct values of specific targets and reference genes for the sample; and iii) the last column with information about the groups or conditions to which the samples belong.
The data can be imported in two ways: i) the Excel tables (.xls/.xlsx)option, in which the tool does not require modification of the decimal separator; and ii) text tables (.txt/.csv)option, in which the default decimal separator is dot(.)and in which it is necessary to configure the text delimiter.
Finally, after verifying the correct formatting of the table, the user needs to click on the "Confirm Data Table" button to proceed with the analysis process.This final step ensures that the tool correctly recognizes and validates the data provided.

Data summary
The selection of target genes (nonreference) is essential because it is crucial to guide the analysis; these genes are related to the research objectives.Once the target-gene(s) have been identified, the user must click the "Update Target Gene" button to confirm the selection.This step ensures that the selected genes are processed and included in the analysis.

Gene reference samples
Outliers are atypical data values that can be identified in RT-qPCR data, with experimental errors being the leading cause of these occurrences.These errors are related to environmental conditions, instrument calibration problems, or other sources of uncontrolled variation that may occur during the experiment [21].It is essential to be aware of outliers and to understand the potential impact of their removal on results.
EndoGeneAnalyzer identifies outliers per group for each gene and their removal can be easily performed using an available function.By default, the tool considers a sample as an outlier if the mean ΔCq > |2| standard deviation from the mean of the group/condition to which the sample belongs for the reference gene.This value can be configured according to the user's preferences.The tool offers two methods of outlier removal: i) removal of all outliers and ii) removal of those that directly interfere with the mean Cq values of the reference genes.
The user can choose which outlier to show and remove by clicking on the "Choose which outlier to remove" radio button."Only Mean" will show and remove only outliers in the mean of the reference genes."All Outliers" will show and remove outliers in each gene individually.This second option tends to result in the removal of more outliers and, consequently, more samples from the analysis.Removing outliers is an interactive process since it decreases the group standard deviation and may reveal additional outliers.In addition, the tool's dynamic interface facilitates the restoration of outliers as part of the analysis process.

Gene reference analysis
This is a crucial step in the tool's operation, as it provides information about the reference genes and their variation between the different groups or conditions studied.Significant changes in the reference genes are observed at this stage, especially in the mean values between the analyzed groups.
The first table generated is the "Gene Reference by group", which presents information about the variation observed between the groups or conditions studied for each reference gene or the averages of the group of reference genes.The statistical tests used are the Wilcoxon-Mann-Whitney test (2 groups) or Kruskall-Wallis/Dunn test (3 or more groups).At this stage, it is expected that there will be no significant changes (p-value > 0.05) in reference genes between the studied groups or conditions.
The tool also provides the "Gene Reference Descriptive Statistics" table, which presents three fundamental values for assessing the reference genes: the gene standard deviation, the sum of squared differences between the mean of each group and the gene mean, and the sum of squared differences between the standard deviation of each group and the gene standard deviation.The formulas used to calculate the sum of squared differences are as follows: n is the number of groups, μ i is the mean Cq of the group, μ g is the mean Cq of the gene, σ i is the standard deviation of the group and σ g is the standard deviation of the gene.

sum:mean
In addition, NormFinder provides information on the stability and suitability of the reference, since this software is integrated into our tool interface.The NormFinder software employs an ANOVA-based model to account for intra-and intergroup variations.The generated table consists of four defined columns: the first column is the gene name, the second column (GroupDif) represents a measure of the difference between the groups, the third column (GroupSD) is the common standard deviation within a group, and the fourth column (Stability) provides the stability value.Thus, genes with lower stability values exhibit less variable expression and maintain a consistently stable expression pattern, while genes with higher stability values exhibit variable expression and uphold a less stable expression pattern [17].

Differential expression analysis
EndoGeneAnalyzer allows for comparison gene expression differences among the investigated groups using ΔCq.ΔCq was calculated as the difference between the target gene and the mean of the reference genes.For a given 2 groups, two statistical tests are integrated into the tool: the Pearson t test and the Wilcoxon-Mann-Whitney Rank Sum test.For the comparisons between 3 or more groups, ANOVA/Tukey's teste and Kruskall-Wallis/Dunn test are applied.
The system also calculates the Shapiro test for the normality of each group and the Fold-Change using the formula 2-ΔΔCq [22].This metric quantifies the difference in expression between a given two groups, considering the relative variation in ΔCq values.

Development of the EndoGeneAnalyzer web tool
The EndoGeneAnalyzer tool was developed with the Shiny framework in R studio software (v.1.7.4,https://shiny.rstudio.com/)[23] that transforms regular R code into an interactive environment that can follow and "react" to remote user instructions.The tool is compatible with all commonly used internet browsers.

Results
To illustrate the usefulness of EndoGeneAnalyzer, we employed unpublished RT-qPCR data from our laboratory, which can be accessed on the "Tutorial" tab at the following web address: https://npobioinfo.shinyapps.io/endogeneanalyzer/.After confirming the table submission, the data were loaded, and the first analysis panel titled "Data Summary" will be available.In this panel, the loaded Cq averages were displayed in a graph highlighting the conditions specified during the upload (Fig 2 ), which initially allowed us to observe of the dynamics of the reference and target genes under the studied conditions.The user also selects the target gene for further analysis at this stage.Subsequently, as previously mentioned regarding outliers and their impact on RT-qPCR data analysis, EndoGeneAnalyzer allows handling these values to be handled.In the "Gene Reference Samples" panel, the user can identify and remove these values in two ways: i) remove "All outliers" or ii) remove those outliers that affect the reference gene averages "Only mean", as shown in Fig 3 .Furthermore, in the "Gene Reference Analysis" tab, statistical reports are generated for the selected reference genes based on ΔCt and the mean of the reference genes.These reports provide information that supports the choice of the most stable set of reference genes (Fig 4).The first report is the result of nonparametric tests, with the choice of the test being conditional on   3) Results generated from the NormFinder tool.The first table displays the statistical results obtained using the Kruskal-Wallis test the number of groups to be compared.To select the best reference genes, it is ideal that there is no significant variation among the groups, especially in the MeanRef column.The second report provides a descriptive analysis of the reference genes (standard deviation; sum.SD.square.diffand sum.mean.square.diff),with lower values indicating better reference gene(s).The last reports are additional analyses generated by the NormFinder software integrated into EndoGeneAnalyzer [16].
In the example provided, the use of the GAPDH gene alone was not a good internal control (< 0.05), mainly when used in combination with ABL, TBP and RPLPO under the conditions investigated (MeanRef � 0.05).However, removing GAPDH from the analysis was demonstrated to be a viable alternative for the conditions investigated (MeanRef > 0.05), given that it is a gene with high variability (Standard.Deviation = 3.35 and Stability = 1.71).In addition, since NormFinder analyses determine expression stability by assessing intra-and intergroup variation, it is possible to identify which reference gene varies between groups and exclude it from the analyses.RPLPO, TBP and ABL (0.49, 0.90, and 0.98, respectively) were the genes with the greatest stability; that is, they exhibited a consistently stable expression pattern, suggesting that they are excellent internal controls.The user can perform differential expression analysis in the last panel of the "Differential Expression Analysis" tool by selecting the target gene and comparison groups (Fig 5).In addition to the graphs generated based on ΔCq, fold change values, normality tests results, and statistical test tables are also displayed.

Discussion
RT-qPCR is the gold standard for gene expression studies in molecular research and clinical practice.This method is prized for its rapidity, reproducibility, high sensitivity, and specificity, for each reference gene and the mean of all genes.Statistically significant values are highlighted in red.The second table presents descriptive data for each examined group, indicating lower values for superior gene expression or a more favorable set of genes.The third table shows the stability data generated by the NormFinder software, providing insights into the reliability of the identified reference genes for accurate gene expression analysis.https://doi.org/10.1371/journal.pone.0299993.g004enabling the detection of gene expression even in low-yield samples.Additionally, RNA-seq is recognized as an increasingly utilized and potent tool for evaluating gene expression [24].In real-time experiments, reference genes provide a comparative basis for assessing relative variations in the expression of target genes [18].Therefore, the identification and validation of these genes are mandatory steps.
Some studies have questioned the use of conventionally established reference genes that may fail to enable the detection of subtle differences in target gene expression under certain conditions due to their high variability [25][26][27] and may lead to misinterpretation of results depending on the experimental context [28].
Outlier removal is critical for the statistical analysis of qPCR data [29], as these values can bias descriptive statistics, such the as mean and standard deviation, that are used to describe gene expression.Thus, removing outliers and identifying genes that vary with the studied conditions helps to ensure accuracy and reliability in interpreting results.
EndoGeneAnalyzer is an invaluable tool for researchers conducting RT-qPCR experiments.This tool offers several features that significantly improve the accuracy and reliability of gene expression studies.It enables precise data normalization, a critical step in gene expression analysis, ensuring that the results are appropriately adjusted and comparable across samples.

Conclusion
In summary, EndoGeneAnalyzer is a new instrument used for RT-qPCR that has produced remarkable results in terms of gene expression analysis.To address the challenge of reference genes selection, our platform, which uses simulative data, was used to demonstrate the identification and assessment suitable genes in unique study settings.All the statistical analyses were combined with stability measures and outlier filtering to allow for informed selection of reference genes to provide a stable basis for data normalization.
EndoGeneAnalyzer provides a comprehensive set of features tailored to genetic research needs, including group and variation analysis, outlier removal, user-friendly data exploration, and differential expression analysis.
EndoGeneAnalyzer promises to be a useful tool for improving the quality of gene expression experiments.Such integrations are anticipated to make the research stronger and reproducible.

Fig 1 .
Fig 1. EndoGeneAnalyzer tool workflow.Users import data and select target genes for analysis, and the tool allows outlier removal.It also calculates stability metrics are calculated, the best reference gene is identified, and differential expression analysis between groups or conditions is enabled.https://doi.org/10.1371/journal.pone.0299993.g001

Fig 2 .Fig 3 .
Fig 2. Analysis of specific reference genes.This figure provides a comprehensive side-by-side analysis of specific reference genes.Each gene is visually represented by a distinct square, allowing for a clear and concise depiction of its individual attributes.Within each square, box plots showcase the mean Cq values for each condition examined in the study.The implementation of vivid colors throughout the figure serves to effectively differentiate the diverse groups or conditions under investigation, enriching our comprehension of their unique characteristics and trends.https://doi.org/10.1371/journal.pone.0299993.g002

Fig 4 .
Fig 4. Gene Reference Descriptive Statistics by EndoGeneAnalyzer.(1) Analysis of reference genes separated by study groups.(2) Descriptive statistics of the analyzed reference genes.(3) Results generated from the NormFinder tool.The first table displays the statistical results obtained using the Kruskal-Wallis test

Fig 5 .
Fig 5. Graphical visualization of differential expression analysis separated by target and study groups.The figure summarizes the differential gene expression among the examined groups using ΔCq values in box plots.The colorcoded conditions and tables showing the fold change values and Shapiro-Wilk normality test results provided comprehensive information.The user interface allows the selection of target genes and groups for statistical comparisons, including Pearson's t-test, Wilcoxon-Mann-Whitney rank sum test (for two groups), ANOVA/Tukey's test, or the Kruskal-Wallis/Dunn test (for three groups).This illustration facilitates the exploration of the molecular differences underlying gene expression variations between conditions.https://doi.org/10.1371/journal.pone.0299993.g005